Feature Engineering
   HOME

TheInfoList



OR:

Feature engineering or feature extraction or feature discovery is the process of using
domain knowledge Domain knowledge is knowledge of a specific, specialized discipline or field, in contrast to general (or domain-independent) knowledge. The term is often used in reference to a more general discipline—for example, in describing a software engin ...
to extract
features Feature may refer to: Computing * Feature (CAD), could be a hole, pocket, or notch * Feature (computer vision), could be an edge, corner or blob * Feature (software design) is an intentional distinguishing characteristic of a software item ...
(characteristics, properties, attributes) from raw
data In the pursuit of knowledge, data (; ) is a collection of discrete Value_(semiotics), values that convey information, describing quantity, qualitative property, quality, fact, statistics, other basic units of meaning, or simply sequences of sy ...
. The motivation is to use these extra features to improve the quality of results from a
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
process, compared with supplying only the raw data to the machine learning process.


Process

The feature engineering process is: *
Brainstorming Brainstorming is a group creativity technique by which efforts are made to find a conclusion for a specific problem by gathering a list of ideas spontaneously contributed by its members. In other words, brainstorming is a situation where a grou ...
or testing features * Deciding what features to create * Creating features * Testing the impact of the identified features on the task * Improving your features if needed * Repeat


Typical engineered features

The following list provides some typical ways to engineer useful features * Numerical transformations (like taking fractions or scaling) * Category encoder like one-hot or target encoder (for
categorical data In statistics, a categorical variable (also called qualitative variable) is a variable that can take on one of a limited, and usually fixed, number of possible values, assigning each individual or other unit of observation to a particular group or ...
) * Clustering * Group aggregated values * Principal component analysis (for numerical data) * Feature construction : building new "physical", knowledge-based parameters relevant to the problem. For example, in physics, construction of
dimensionless numbers A dimensionless quantity (also known as a bare quantity, pure quantity, or scalar quantity as well as quantity of dimension one) is a quantity to which no physical dimension is assigned, with a corresponding SI unit of measurement of one (or 1) ...
such as Reynolds number in fluid dynamics,
Nusselt number In thermal fluid dynamics, the Nusselt number (, after Wilhelm Nusselt) is the ratio of convective to conductive heat transfer at a boundary in a fluid. Convection includes both advection (fluid motion) and diffusion (conduction). The conductiv ...
in
heat transfer Heat transfer is a discipline of thermal engineering that concerns the generation, use, conversion, and exchange of thermal energy (heat) between physical systems. Heat transfer is classified into various mechanisms, such as thermal conduction, ...
,
Archimedes number In viscous fluid dynamics, the Archimedes number (Ar), is a dimensionless number used to determine the motion of fluids due to density differences, named after the ancient Greek scientist and mathematician Archimedes. It is the ratio of gravit ...
in
sedimentation Sedimentation is the deposition of sediments. It takes place when particles in suspension settle out of the fluid in which they are entrained and come to rest against a barrier. This is due to their motion through the fluid in response to the ...
, construction of first approximations of the solution such as analytical
strength of materials The field of strength of materials, also called mechanics of materials, typically refers to various methods of calculating the stresses and strains in structural members, such as beams, columns, and shafts. The methods employed to predict the re ...
solutions in mechanics, etc.


Relevance

Features vary in significance. Even relatively insignificant features may contribute to a model.
Feature selection In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construc ...
can reduce the number of features to prevent a model from becoming too specific to the training data set (overfitting).


Explosion

Feature explosion occurs when the number of identified features grows inappropriately. Common causes include: * Feature templates - implementing feature templates instead of coding new features * Feature combinations - combinations that cannot be represented by a linear system Feature explosion can be limited via techniques such as:
regularization Regularization may refer to: * Regularization (linguistics) * Regularization (mathematics) * Regularization (physics) In physics, especially quantum field theory, regularization is a method of modifying observables which have singularities in ...
,
kernel method In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). The general task of pattern analysis is to find and study general types of relations (for example ...
s, and
feature selection In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construc ...
.


Automation

Automation of feature engineering is a research topic that dates back to the 1990s. Machine learning software that incorporates automated feature engineering has been commercially available since 2016. Related academic literature can be roughly separated into two types: * Multi-relational decision tree learning (MRDTL) uses a supervised algorithm that is similar to a decision tree. * Deep Feature Synthesis uses simpler methods.


Multi-relational decision tree learning (MRDTL)

MRDTL generates features in the form of SQL queries by successively adding clauses to the queries. For instance, the algorithm might start out with SELECT COUNT(*) FROM ATOM t1 LEFT JOIN MOLECULE t2 ON t1.mol_id = t2.mol_id GROUP BY t1.mol_id The query can then successively be refined by adding conditions, such as "WHERE t1.charge <= -0.392". However, most MRDTL studies base implementations on relational databases, which results in many redundant operations. These redundancies can be reduced by using techniques such as tuple id propagation. Efficiency can be increased by using incremental updates, which eliminates redundancies.


Open-source implementations

There are a number of open-source libraries and tools that automate feature engineering on relational data and time series: * featuretools is a
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (pro ...
library for transforming time series and relational data into feature matrices for machine learning. * OneBM or One-Button Machine combines feature transformations and feature selection on relational data with feature selection techniques. * getML community is an open source tool for automated feature engineering on time series and relational data. It is implemented in C/
C++ C++ (pronounced "C plus plus") is a high-level general-purpose programming language created by Danish computer scientist Bjarne Stroustrup as an extension of the C programming language, or "C with Classes". The language has expanded significan ...
with a Python interface. It has been shown to be at least 60 times faster than tsflex, tsfresh, tsfel, featuretools or kats. * tsfresh is a Python library for feature extraction on time series data. It evaluates the quality of the features using hypothesis testing. * tsflex is an open source Python library for extracting features from time series data. Despite being 100% written in Python, it has been shown to be faster and more memory efficient than tsfresh, seglearn or tsfel. * seglearn is an extension for multivariate, sequential time series data to the
scikit-learn scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support-vector ...
Python library. * tsfel is a Python package for feature extraction on time series data. * kats is a Python toolkit for analyzing time series data.


Deep feature synthesis

The deep feature synthesis (DFS) algorithm beat 615 of 906 human teams in a competition.


Feature stores

The Feature Store is where the features are stored and organized for the explicit purpose of being used to either train models (by data scientists) or make predictions (by applications that have a trained model). It is a central location where you can either create or update groups of features created from multiple different data sources, or create and update new datasets from those feature groups for training models or for use in applications that do not want to compute the features but just retrieve them when it needs them to make predictions. A feature store includes the ability to store code used to generate features, apply the code to raw data, and serve those features to models upon request. Useful capabilities include feature versioning and policies governing the circumstances under which features can be used. Feature stores can be standalone software tools or built into machine learning platforms.


See also

*
Covariate Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or deman ...
*
Data transformation In computing, data transformation is the process of converting data from one format or structure into another format or structure. It is a fundamental aspect of most data integrationCIO.com. Agile Comes to Data Integration. Retrieved from: htt ...
*
Feature extraction In machine learning, pattern recognition, and image processing, feature extraction starts from an initial set of measured data and builds derived values (features) intended to be informative and non-redundant, facilitating the subsequent learning a ...
*
Feature learning In machine learning, feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature ...
*
Hashing trick In machine learning, feature hashing, also known as the hashing trick (by analogy to the kernel trick), is a fast and space-efficient way of vectorizing features, i.e. turning arbitrary features into indices in a vector or matrix. It works by apply ...
*
Kernel method In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). The general task of pattern analysis is to find and study general types of relations (for example ...
*
List of datasets for machine learning research These datasets are applied for machine learning research and have been cited in peer-reviewed academic journals. Datasets are an integral part of the field of machine learning. Major advances in this field can result from advances in learning ...
*
Space mapping The space mapping methodology for modeling and design optimization of engineering systems was first discovered by John Bandler in 1993. It uses relevant existing knowledge to speed up model generation and design optimization of a system. The kno ...
*
Instrumental variables estimation In statistics, econometrics, epidemiology and related disciplines, the method of instrumental variables (IV) is used to estimate causal relationships when controlled experiments are not feasible or when a treatment is not successfully delivered t ...


References


Further reading

* * *{{cite book , first1=Nina , last1=Zumel , first2=John , last2=Mount , chapter=Data Engineering and Data Shaping , title=Practical Data Science with R , publisher=Manning , edition=2nd , year=2020 , isbn=978-1-61729-587-4 , pages=113–160 Machine learning Data analysis